-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large speedup in countrycode()
#341
Large speedup in countrycode()
#341
Conversation
Oooh, this looks awesome, thanks! Frankly, it's a bit scary too, so I'll want to play with it a bit before merging. Unfortunately, it's the start of semester, so I may not get to this for a little while. If |
If @cjyetman has time and interest (no pressure), he could also merge this after some testing. |
Yup, completely agree, the last thing you want is realize later that codes and names don't match. I'll see later if I can test more extensively |
FYI an improvement that makes the code much simpler was suggested to me on SO, I'll modify the PR |
New timing after the simplification: #> # A tibble: 1 × 3
#> pkg median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 fork 1.18s 742MB |
@vincentarelbundock or @cjyetman can you enable the CI for all commits in the PR so that I know if they pass? |
I cannot (enable CI on commits from forks), but I have approved the CI to run on your most recent push. |
For information, if you're ok with adding a dependency on |
Got it. I think it's best to stay dependency-free, though |
Looks like all the tests pass. I guess that's better than whatever impressionistic sense I could get from playing with different things. What do you think @cjyetman , should I just merge this? |
I'll try to take a look at it today. |
I ran |
@etiennebacher can you explain what |
It allows us to run the same code automatically with two (or more) versions of the same package. I specified the |
This seems to work as expected, tests pass, build passes, so that seems ok. I can't exactly replicate the example because I don't know where A few suggestions:
|
Thanks a ton for the review @cjyetman, I appreciate your time. I also agree with the comments about white space and separate PRs for different concepts. That said, it's probably fine to merge this now in order to avoid more busywork for Etienne. So @etiennebacher, if you want to add your name as a contributor (your choice), I can merge the PR. A 5x speedup is nothing to scoff at. Very nice! Thanks! |
Thanks for the link, I was not finding {cross} on CRAN. |
Note that if multiple destinations are used, the speed-up library(bench)
out <- cross::run(
pkgs = c("vincentarelbundock/countrycode",
"etiennebacher/countrycode@large-speedup"),
~{
library(countrycode)
sourcevar <- sample(codelist$country.name.en, 1e7, TRUE)
bench::mark(
countrycode(sourcevar, "country.name", c("iso3c", "iso3n")),
iterations = 10
)
}
)
tidyr::unnest(out, result) |>
dplyr::select(pkg, median, mem_alloc) |>
dplyr::mutate(pkg = ifelse(grepl("vincent", pkg), "main", "fork"))
#> # A tibble: 2 × 3
#> pkg median mem_alloc
#> <chr> <bch:tm> <bch:byt>
#> 1 main 5.95s 2.45GB
#> 2 fork 4.56s 2.45GB |
Thanks to both of you (and especially Etienne!). Merged! |
Refactor of the way we search for regex matches on each input. Instead of running onegrepl()
per regex, I collapse all regexes into a single one by putting them into capture groups. Then, a single run ofgregexpr()
is needed and we can extract the matches from the information on the capture groups. I put a bit more explanations in the code, but you can also refer to this StackOverflow post.Instead of using nested
sapply()
, we can pass the whole vector of inputs (country names, iso, etc.) togrepl()
directly.There was also one
ifelse()
that took a lot of time but isn't needed when we only have onedestination
.This leads to a very large speedup: